Back

Cell Genomics

Elsevier BV

All preprints, ranked by how well they match Cell Genomics's content profile, based on 162 papers previously published here. The average preprint has a 0.22% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
Low-pass Whole Genome Imputation Enables the Characterization of Polygenic Breast Cancer Risk in the Indigenous Arab Population

Al-Jumaan, M.; Chu, H.; Al-Sulaiman, A.; Camp, S. Y.; Han, S.; Gillani, R.; Al Marzooq, Y.; Almulhim, F.; Vatte, C.; Al Nemer, A.; Almuhanna, A.; Van Allen, E. M.; Al-Ali, A.; AlDubayan, S. H.

2022-12-09 genetic and genomic medicine 10.1101/2022.12.07.22282785 medRxiv
Top 0.1%
44.7%
Show abstract

The indigenous Arab population has traditionally been underrepresented in cancer genomics studies, and as a result the polygenic risk landscape of breast cancer in the population remains elusive. Here we show by utilizing low-pass whole genome sequencing (lpWGS), we can accurately impute population-specific variants with high exome concordance (median dosage correlation: 0.9459, Interquartile range: 0.9410-0.9490) and construct breast cancer burden-sensitive polygenic risk scores (PRS) using publicly available resources. After adjusting the PRS to the Arab population, we found significant associations between PRS performance in risk prediction and first-degree relative breast cancer history prediction (Spearman rho=0.43, p = 0.03), where breast cancer patients in the top PRS decile are 5.53 (95% CI: 1.76-17.97, p = 0.003) times more likely to also have a first degree relative diagnosed with breast cancer compared to those in the middle deciles. In addition, we found evidence for the genetic liability threshold model of breast cancer where among patients with a family history of breast cancer, pathogenic rare variant carriers had significantly lower PRS than non-carriers (p = 0.0205, M.W.U.) while for non-carriers every standard deviation increase in PRS corresponded to 4.52 years (95% CI: 8.88-0.17, p = 0.042) earlier age of presentation. Overall, our study provides a viable strategy utilizing lpWGS to assess polygenic risk in an understudied population and took steps in addressing existing global health disparities.

2
Natural variation in gene expression and Zika virus susceptibility revealed by villages of neural progenitor cells

Wells, M. F.; Nemesh, J.; Ghosh, S.; Mitchell, J. M.; Mello, C. J.; Meyer, D.; Raghunathan, K.; Tegtmeyer, M.; Hawes, D.; Neumann, A.; Worringer, K. A.; Raymond, J. J.; Kommineni, S.; Chan, K.; Ho, D.; Peterson, B. K.; Piccioni, F.; Nehme, R. F.; Eggan, K.; McCarroll, S. A.

2021-11-09 genomics 10.1101/2021.11.08.467815 medRxiv
Top 0.1%
41.1%
Show abstract

Variation in the human genome contributes to abundant diversity in human traits and vulnerabilities, but the underlying molecular and cellular mechanisms are not yet known, and will need scalable approaches to accelerate their recognition. Here, we advanced and applied an experimental platform that analyzes genetic, molecular, and phenotypic heterogeneity across cells from very many human donors cultured in a single, shared in vitro environment, with algorithms (Dropulation and Census-seq) for assigning phenotypes to individual donors. We used natural genetic variation and synthetic (CRISPR-Cas9) genetic perturbations to analyze the vulnerability of neural progenitor cells to infection with Zika virus. These analyses identified a common variant in the antiviral IFITM3 gene that regulated IFITM3 expression and explained most inter-individual variation in NPCs susceptibility to Zika virus infectivity. These and other approaches could provide scalable ways to recognize the impact of genes and genetic variation on cellular phenotypes. HIGHLIGHTSO_LIMeasuring cellular phenotypes in iPSCs and hPSC-derived NPCs from many donors C_LIO_LIEffects of donor sex, cell source, genetic and other variables on hPSC RNA expression C_LIO_LINatural genetic variation and synthetic perturbation screens both identify IFITM3 in NPC susceptibility to Zika virus C_LIO_LIA common genetic variant in IFITM3 explains most inter-individual variation in NPC susceptibility to Zika virus C_LI

3
Replicating the Association of Variants in BSN and APBA1 with Obesity in Diverse Populations

Robinson, J. R.; Denny, J. C.; Zeng, C.

2024-08-22 genetic and genomic medicine 10.1101/2024.08.21.24312322 medRxiv
Top 0.1%
40.0%
Show abstract

In a recent study by Zhao et al., rare protein-truncating variants (PTVs) in the BSN and APBA1 genes showed effects on obesity that exceeded those of well-known genes such as MC4R in a UK cohort. In this study, we leveraged the All of Us Research Program, to investigate the association of predicted LoF (pLoF) PTVs in BSN and APBA1 with body mass index (BMI) across a population of diverse ancestry. Our analysis revealed that the impact of pLoF variants in BSN and APBA1 on BMI was notably greater in this cohort, especially among individuals of European ancestry. Additionally, a phenome-wide association study (PheWAS) using the extensive phenotypic data available in the All of Us Research Program uncovered novel associations of BSN and APBA1heterozygous pLoF carriers with various phenotypes. Specifically, BSN pLoF variants were associated with pulmonary hypertension, atrial fibrillation, and anticoagulant use, while APBA1 pLoF variants were linked to disorders of the temporomandibular joint. These findings underscore the potential of large-scale biobanks in advancing genetic discovery.

4
Mismatch tolerance of a gRNA for CRISPR-based gene activation confers broad activity critical for cell reprogramming

Reisman, S. J.; Zhu, W.; Miller, S. E.; Halabi, D.; Sangvai, N.; Crawford, G. E.; Gordan, R.; Gersbach, C. A.

2026-02-03 genomics 10.64898/2026.02.01.703129 medRxiv
Top 0.1%
38.4%
Show abstract

CRISPR activation and interference systems (CRISPRa/i) are widely used for programmable transcriptional control. Although these technologies are capable of highly specific single-gene activity, some applications of transcriptional network reprogramming require broad, genome-wide effects. Here, we identify a CRISPRa gRNA that robustly reprograms astrocyte transcriptional state. Unexpectedly, this activity arises from extensive off-target binding that induces expression changes in thousands of genes, unlike neighboring gRNAs targeting the same intended on-target site. We leverage this promiscuous gRNA to dissect determinants of gRNA-driven off-target dCas9 binding in the context of transcriptional reprogramming. Using ChIP-seq, high-throughput protein-binding microarrays, and gRNA-variant library screening in cells, we demonstrate that PAM-proximal bases are primary determinants of genomic binding, mismatch tolerance is both gRNA- and base-specific, and targeted mutations within the PAM-proximal region can tune gRNA specificity. We further demonstrate that CRISPRa-driven phenotypes can reflect combined contributions from widespread off-target activity and dose-dependent on-target effects. These findings highlight the potentially widespread impacts of CRISPRa off-target activity, underscore the need to account for cryptic effects when selecting and evaluating gRNAs for programming cell phenotypes, and demonstrate that multi-site binding by CRISPRa systems can be exploited as a feature for network-level perturbations in cell reprogramming. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=83 SRC="FIGDIR/small/703129v1_ufig1.gif" ALT="Figure 1"> View larger version (34K): org.highwire.dtl.DTLVardef@b697b0org.highwire.dtl.DTLVardef@1a0b390org.highwire.dtl.DTLVardef@16ce710org.highwire.dtl.DTLVardef@b5d87a_HPS_FORMAT_FIGEXP M_FIG C_FIG

5
Accurate, sensitive, and efficient chromatin accessibility quantification at target loci using UNIChro-seq.

Kono, M.; Hatano, H.; Asahara, K.; Nakano, M.; Bagherzadeh, R.; Kawashima, T.; Arakawa, T.; Sato, M.; Inokuchi, H.; Nishino, T.; Itamiya, T.; Takahashi, H.; Natsumoto, B.; Suzuki, A.; Yamamoto, K.; Ishigaki, K.

2025-07-29 genetic and genomic medicine 10.1101/2025.07.29.25332340 medRxiv
Top 0.1%
37.9%
Show abstract

Recent progress in statistical and experimental fine mapping of disease risk variants prompts us to focus on specific target loci for functional investigation. However, current genetics is hindered by a limited toolbox for target-loci analysis. To address this, we developed UNIChro-seq, a method that digitally counts accessible chromatin molecules at target loci. UNIChro-seq allows for accurate, sensitive, and efficient quantification of allelic effects compared to conventional methods. Using UNIChro-seq, we investigated the effects of 57 autoimmunity risk alleles on chromatin accessibility and estimated the causal effects of 20 artificial variants generated through genome editing. As a caveat, non-negligible fraction of the edited allele exhibited a falsely positive effect on chromatin accessibility, which can be effectively distinguished from the true causal effect through bi-directional genome editing. Finally, functional dissection of a fine-mapped risk variant at the LEF1 locus illuminated its impact on T cell pathology in rheumatoid arthritis. Together, these findings underscore the utility of combining UNIChro-seq with genome editing technology to enable precise and scalable functional analysis of disease-associated loci.

6
Concordance and dissonance: A genome-wide analysis of self-declared versus inferred ancestry in 10,250 participants from the HostSeq cohort

Warren, R. L.; Birol, I.

2025-06-13 genomics 10.1101/2025.06.10.658783 medRxiv
Top 0.1%
37.7%
Show abstract

Accurate characterization of human diversity is foundational to equitable genomics. In this study, we analyzed self-declared and genome-derived ancestry in 10,250 participants from the pan-Canadian HostSeq cohort. Using the alignment-free ntRoot algorithm on whole genome sequencing data, we inferred global and local ancestry at the continental super-population level and compared these with self-reported sociocultural identity categories. We observed high concordance among individuals self-identifying as White (98.8%), Black (97.2%), East Asian (96.1%), and South Asian (89.9%). Concordance was lower among those self-identifying as Hispanic (74.6%), Middle Eastern / Central Asian (67.9%), or Indigenous (40.7%), reflecting greater admixture complexity. Agreement between expected and inferred ancestry labels was modest (Cohens kappa {kappa} = -0.01 unweighted; 0.35 weighted), and ancestry discordance was strongly associated with higher Shannon entropy of ancestry fractions. Principal component analysis of ntRoot-derived ancestry composition revealed tightly clustered profiles in some groups and broader, overlapping distributions in others, illustrating how sociocultural identities and genomic data capture distinct but intersecting dimensions of human diversity. These findings support the complementary use of genome-derived continental ancestry fractions alongside self-identification, particularly in settings where sociocultural labels may be incomplete, heterogenous, or poorly aligned with genetic background. This approach can improve scientific rigor and enhance inclusion in population-scale genomics while respecting the social meaning of identity. We emphasize that genetic ancestry estimates are not proxies for race, which is a social construct with no biological basis.

7
Biological machine learning combined with bacterial population genomics reveals common and rare allelic variants of genes to cause disease

Bandoy, D. D. R.; Weimer, B. C.

2019-08-20 genomics 10.1101/739540 medRxiv
Top 0.1%
36.8%
Show abstract

Highly dimensional data generated from bacterial whole genome sequencing is providing unprecedented scale of information that requires appropriate statistical frameworks of analysis to infer biological function from bacterial genomic populations. Application of genome wide association study (GWAS) methods is an emerging approach with bacterial population genomics that yields a list of genes associated with a phenotype with an undefined importance among the candidates in the list. Here, we validate the combination of GWAS, machine learning, and pathogenic bacterial population genomics as a novel scheme to identify SNPs and rank allelic variants to determine associations for accurate estimation of disease phenotype. This approach parsed a dataset of 1.2 million SNPs that resulted in a ranked importance of associated alleles of Campylobacter jejuni porA using multiple spatial locations over a 30-year period. We validated this approach using previously proven laboratory experimental alleles from an in vivo guinea pig abortion model. This approach, termed BioML, defined intestinal and extraintestinal groups that have differential allelic variants that cause abortion. Divergent variants containing indels that defeated gene callers were rescued using biological context and knowledge that resulted in defining rare and divergent variants that were maintained in the population over two continents and 30 years. This study defines the capability of machine learning coupled to GWAS and population genomics to simultaneously identify and rank alleles to define their role in abortion, and more broadly infectious disease.

8
A novel functional genomics atlas coupled with convolutional neural networks facilitates clinical interpretation of disease relevant variants in non-coding regulatory elements

Deng, R.; Perenthaler, E.; Nikoncuk, A.; Yousefi, S.; Lanko, K.; Schot, R.; Maresca, M.; Parker, M. J.; van Ijcken, W. F. J.; Park, J.; Sturm, M.; Haack, T. B.; Genomics England Research Consortium, ; Roshchupkin, G. V.; Mulugeta, E.; Barakat, T. S.

2024-04-16 genetic and genomic medicine 10.1101/2024.04.13.24305761 medRxiv
Top 0.1%
33.0%
Show abstract

Genome-wide assessment of genetic variation is becoming routine in human genetics, but functional interpretation of non-coding variants both in common and rare diseases remains extremely challenging. Here, we employed the massively parallel reporter assay ChIP- STARR-seq to functionally annotate the activity of >145 thousand non-coding regulatory elements (NCREs) in human neural stem cells, modelling early brain development. Highly active NCREs show increased sequence constraint and harbour de novo variants in individuals affected by neurodevelopmental disorders. They are enriched for transcription factor (TF) motifs including YY1 and p53 family members and for primate-specific transposable elements, providing insights on gene regulatory mechanisms in NSCs. Examining episomal NCRE activity of the same sequences in human embryonic stem cells identified cell type differential activity and primed NCREs, accompanied by a rewiring of the epigenome landscape. Leveraging the experimentally measured NCRE activity and nucleotide composition of the assessed sequences, we built BRAIN-MAGNET, a functionally validated convolutional neural network that predicts NCRE activity based on DNA sequence composition and identifies functionally relevant nucleotides required for NCRE function. The application of BRAIN-MAGNET allows fine-mapping of GWAS loci identified for common neurological traits and prioritizing of possible disease-causing rare non-coding variants in currently genetically unexplained individuals with neurogenetic disorders, including those from the Genomics England 100,000 Genomes project, identifying novel enhanceropathies. We foresee that this NCRE atlas and BRAIN-MAGNET will help reduce missing heritability in human genetics by limiting the search space for functionally relevant non-coding genetic variation. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=200 SRC="FIGDIR/small/24305761v2_ufig1.gif" ALT="Figure 1"> View larger version (37K): org.highwire.dtl.DTLVardef@eace29org.highwire.dtl.DTLVardef@184f59aorg.highwire.dtl.DTLVardef@18772borg.highwire.dtl.DTLVardef@37c4d2_HPS_FORMAT_FIGEXP M_FIG C_FIG

9
Genomic Disaggregation Reveals Distinct Admixture Patterns and Cardiometabolic Risk Loci in Black Hawaiians

Vand, K.; Badia, N.; Khotchouk, B.

2026-01-26 genomics 10.64898/2026.01.24.701518 medRxiv
Top 0.1%
32.9%
Show abstract

BackgroundThe systematic aggregation of distinct admixed subpopulations into broad racial categories creates genomic blind spots that undermine the promise of precision medicine. Black Hawaiians (BH) exemplify this exclusion. Characterized by a unique tri-continental ancestry (African, European, and Native Hawaiian/Pacific Islander) and disproportionate cardiometabolic burden, their population-specific risk drivers remain masked by systematic conflation with broader ancestral cohorts. MethodsWe performed the first comprehensive genomic analysis of 287 BH participants from the NIH All of Us Research Program using whole-genome sequencing (WGS). Following haplotype phasing (SHAPEIT5), we characterized population structure (ADMIXTURE, PCA), inferred local ancestry tracts (RFMix), and reconstructed demographic history (SMC++). Genome-wide allele frequency differentiation (AFD) was calculated against tri-continental reference panels, and Electronic Health Record (EHR) data were integrated to quantify the populations cardiometabolic burden. ResultsThe cohort exhibited complex tri-continental admixture (mean: 67.0% African, 22.1% European, 10.9% NHPI) with high inter-individual heterogeneity. Phenotypic analysis confirmed a substantial disease burden (34.8% hypertension, mean BMI 31.2 kg/m2), while SMC++ reconstruction revealed a sharp demographic bottleneck in recent generations. Genome-wide AFD analysis of 8.9M variants demonstrated systematic differentiation (mean {Delta} vs African: 0.041, NHPI: 0.069, European: 0.084). The top 100 differentiated variants mapped to 31 unique genes, identifying distinct candidates including MYO9A, RAB37, and PEAR1. Notably, differentiation in the cytoskeletal regulator MYO9A suggests a mechanostructural etiology for kidney disease distinct from classical APOL1 cytotoxicity, while PEAR1 variants implicate population-specific pharmacogenomic resistance to antiplatelet therapy. ConclusionThis study highlights the critical necessity of data disaggregation in genomic research, using the Black Hawaiian population as a paradigmatic example. By distinguishing this community from broader aggregate groups, we uncovered a distinct genomic architecture with unique admixture patterns that drive specific cardiometabolic risks. These findings demonstrate the necessity of granular resolution for achieving equitable precision medicine.

10
Whole-genome sequencing pilot of the Central Asian Genomic Diversity Project reveals distinct histories, adaptation, and introgression

He, G. G.; Su, H.; Sun, Q.; Tang, R.; Yang, Q.; Luo, L.; Zhong, J.; Sabitov, Z.; Cheng, J.; Bu, F.; Lu, Y.; Liu, C.; Yuan, H.; Wei, L.-H.; Zhabagin, M.; Wang, M.

2025-08-28 genetic and genomic medicine 10.1101/2025.08.26.25334450 medRxiv
Top 0.1%
32.8%
Show abstract

The underrepresentation of Central Asian genomic data has constrained our understanding of their demographic history and hindered advancements in precision medicine and health equity. Despite the regions rich historical tapestry, characterized by numerous trans-Eurasian migrations following the advent of agriculture and pastoralism, the genetic contributions of ancient Eurasians to modern Central Asians remain poorly understood. To address this gap, we performed an anthropologically informed Central Asian Genomic Diversity Project and reported the results of pilot whole-genome sequencing work on 166 Central Asians and Afghanistan Hazaras (CAAH) from 20 populations to investigate their demographic history, local adaptation, medical relevance, and archaic introgression. Significant genetic differentiation among CAAH populations was revealed. Tajik, Karluks, Turkmen, and Uzbek individuals exhibited higher proportions of West Eurasian ancestry, whereas the Kyrgyz, Karakalpak, Uyghur, and Hazara populations presented increased ancestry related to ancient Northeast Asians. In contrast, Dungans demonstrated a predominance of East Asian-derived ancestry. Four Turkic-related genetic clusters corresponding to geographic distribution were identified, supporting the "Northeast Asia origin" hypothesis for Turkic groups. Additionally, two Indo-European genetic clines were detected, with Hazaras being notably isolated. Strong genetic affinities were observed between Hazaras and Altaic groups in Siberia and between Dungans and Sino-Tibetan-speaking East Asians, underscoring the impact of ancient long-distance migrations on Eurasian genetic diversity. The recent east-west admixture in CAAH was estimated to have occurred 23-31 generations ago, aligning with the Song and Yuan dynasties and the Mongol Empire period. The mutation spectra of candidate disease-causing variants and pharmacogenomic genes were characterized, indicating that differentiated demographic histories significantly influence the genetic architecture of diseases among different Central Asians. Differential post-admixture adaptation signatures identified in the four genetically distinct groups have substantial effects on immune, metabolic, neural, and physical traits. Shifts in subsistence strategies significantly shaped the genetic architecture of complex traits in Central Asians. Neanderthal-like sequences exhibited varying phenotypic effects across genetically distinct CAAH strains, including susceptibility to immune and psychiatric conditions in West Eurasian-biased CAAH individuals and drug metabolism in East Eurasian-biased CAAH individuals. Denisovan-like segments were primarily linked to type 2 diabetes, etc. This research on Central Asian genomic diversity enhances the understanding of their evolutionary history and admixture events, promoting health equity and advancing precision medicine initiatives. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=187 SRC="FIGDIR/small/25334450v2_ufig1.gif" ALT="Figure 1"> View larger version (115K): org.highwire.dtl.DTLVardef@f5c51corg.highwire.dtl.DTLVardef@15fcb67org.highwire.dtl.DTLVardef@22ea5corg.highwire.dtl.DTLVardef@4858c1_HPS_FORMAT_FIGEXP M_FIG C_FIG He et al. conducted a pilot study on the Central Asian Genomic Diversity Project, utilizing whole-genome sequencing of 166 individuals from 20 Central Asian populations. They identified fine-scale population substructures shaped by complex ancient trans-Eurasian migration and admixture processes. Their comprehensive analysis revealed post-admixture adaptations and archaic introgressions, shedding light on demographic events that influenced medically relevant mutation spectra and adaptations affecting immune, metabolic, neural, and physical traits. Neanderthal introgression segments significantly influence phenotypic traits, including susceptibility to immune and psychiatric disorders, whereas Denisovan-derived sequences have effects on disease susceptibility. This work advances our understanding of Central Asian genomic diversity and evolutionary history as well as their implications for health.

11
Determining susceptibility loci in triple negative breast cancer using a novel pre-clinical model

Simon, S. E.; Simmons, B. W.; Kim, M.; Joseph, S. C.; Korba, E.; Marathe, S. J.; Bohm, M. S.; Mahajan, S.; Bohl, C.; Read, R.; Holt, J.; Hayes, N.; Lu, L.; Williams, R.; Sipe, L.; Ashbrook, D. G.; Makowski, L.

2024-02-10 genetics 10.1101/2024.02.08.579359 medRxiv
Top 0.1%
32.1%
Show abstract

Breast cancer (BC) is the most common cancer and the second cause of death in US women. Our lack of understanding of how genetic variants affect molecular mechanisms that mediate BC aggression poses a substantial obstacle to advancements in cancer diagnosis and therapy. To examine genetic variants on BC traits, a novel murine model was created with robust phenotypic and genomic variation. The FVB C3(1)-T-antigen ("C3Tag") mouse develops spontaneous tumors in the mammary glands of female mice with a mean latency of 4-5 months of age. This genetically engineered mouse model (GEMM) is well established to resemble human basal-like TNBC. TNBC is an aggressive subtype with few clinical approaches and poor patient outcomes. Thus, to model human heterogeneity in BC outcomes, we systematically crossed the C3Tag GEMM into the BXD recombinant inbred family - the largest and best characterized genetic reference population. The new model is termed "BXD-BC" and F1 hybrids of the cross have isogenic genomes that are reproducible. BXD-BCs are a potent tool to determine the impact of genetic modifiers on BC tumor traits. We hypothesized that examination of BXD-BC GEMMs will enable the identification of susceptibility loci, candidate genes, and molecular networks that underlie variation of multiple BC phenotypes. Using N=29 BXD-BC strains, we demonstrated significant heritable variations in the severity of TNBC characteristics such as tumor latency, multiplicity, and survival. Interestingly, 2 BXD-BC strains never developed tumors out to 1 year of age. Thus, BXD-BC strains demonstrate variance in cancer susceptibility and progression compared to the parent C3Tag GEMM, indicating the presence of genetic modifiers. Through an unbiased systematic quantification of breast cancer severity across BXD-BC hybrids, we identified several significant quantitative trait loci (QTL) and candidate genes for specific tumor traits. In combination with public human GWAS datasets, we defined syntenic regions, candidate genes, and underlying networks through cross-species systems genetics analyses to demonstrate the translational validity of conserved, biologically relevant, and targetable candidates. Our findings suggest conserved candidates predicting TNBC patient survival. In sum, the BXD-BC resource is an innovative, reliable, and robust preclinical model that reflects robust genetic heterogeneity. Using cutting edge systems genetics, we have identified genetic modifiers of BC phenotypic variation that could be targeted to advance therapeutic limitations or as biomarkers of risk or response to therapy.

12
All of Us diversity and scale improve polygenic prediction contextually with greatest improvements for underrepresented populations

Tsuo, K.; Shi, Z.; Ge, T.; Mandla, R.; Hou, K.; Ding, Y.; Pasaniuc, B.; Wang, Y.; Martin, A. R.

2024-08-06 genomics 10.1101/2024.08.06.606846 medRxiv
Top 0.1%
31.9%
Show abstract

Recent studies have demonstrated that polygenic risk scores (PRS) trained on multi-ancestry data can improve prediction accuracy in groups historically underrepresented in genomic studies, but the availability of linked health and genetic data from large-scale diverse cohorts representative of a wide spectrum of human diversity remains limited. To address this need, the All of Us research program (AoU) generated whole-genome sequences of 245,388 individuals (release v7) who collectively reflect the diversity of the USA. Leveraging this resource and another widely-used population-scale biobank, the UK Biobank (UKB) with a half million participants, we developed PRS trained on multi-ancestry and multi-biobank data with up to [~]750,000 participants for 32 common, complex traits and diseases across a range of genetic architectures. We then evaluated effects of ancestry, PRS methodology, and genetic architecture on PRS accuracy across a held out subset of ancestrally diverse AoU participants. Overall, we found that the increased diversity of AoU significantly improved PRS performance in some participants in AoU, especially underrepresented individuals, across multiple phenotypes. Notably, maximizing sample size by combining discovery data across AoU and UKB is not the optimal approach for predicting some phenotypes particularly in African ancestry populations; rather, using data from only AoU for these traits resulted in the greatest accuracy. This was especially true for less polygenic traits with large ancestry-enriched effects, and larger heritability estimates in African ancestry populations, such as neutrophil count (R2: 0.055 vs. 0.035 using AoU vs. cross-biobank meta-analysis, respectively, because of e.g. DARC). Lastly, we calculated individual-level PRS accuracies rather than grouping by continental ancestry, a critical step towards interpretability in precision medicine. Individualized PRS accuracy decays linearly as a function of ancestry divergence, but the slope was smaller using multi-ancestry GWAS compared to using European GWAS. Our results highlight the potential of biobanks with more balanced representations of human diversity to facilitate more accurate PRS for the individuals least represented in genomic studies.

13
Genome writing and Targeted Delivery of the NKX6-3/ANK1 gene cluster and its Type 2 Diabetes GWAS Variants to Human iPSCs

Chalhoub, N.; Varshney, A.; Zhang, W.; Uhl, s.; Laurent, J. M.; Mcloughlin, C.; Ashe, H.; Dale, N.; Mou, x.; Ramnarine, K.; Goldberg, J.; Paull, D.; Maurano, M. T.; Brosh, R.; Fenyo, D.; Cipriani, F.; Parker, S.; Boeke, J. D.

2026-01-05 genetics 10.64898/2026.01.04.697539 medRxiv
Top 0.1%
28.7%
Show abstract

Genome-wide association studies (GWAS) identified over 600 loci containing single-nucleotide polymorphisms (SNPs) associated with type 2 diabetes (T2D), most of which reside in non-coding regions. Among the set of T2D SNPs, linking causal genome variants to disease risk experimentally has remained a challenge; however, advances in synthetic mammalian genome writing techniques now enable the delivery of multiple haplotypes to human induced pluripotent stem cells (hiPSCs) to create a series of isogenic cell lines that can be differentiated and phenotyped in vitro. Here, to begin efforts in dissecting a T2D GWAS locus, we engineered an NKX6-3/ANK1 gene cluster knockout hiPSC line and introduced a landing pad facilitating the delivery of synthetic haplotype payloads. We built four haplotypes, including several that are not observed in nature, containing risk SNPs spanning the NKX6-3/ANK1 gene cluster using a method called "variant Switching Auxotrophic markers for Integration" (vSwAP-In), and integrated them precisely into hiPSCs. NKX6-3/ANK1 deletion blocked pancreatic progenitor and skeletal muscle differentiation, suggesting that NKX6-3 and ANK1 are required for early pancreatic and skeletal muscle development, and perhaps related to the existence of two nonoverlapping sets of SNPs in linkage disequilibrium that associate with the expression of the two adjacent genes. When NKX6-3/ANK1 T2D "Risk" haplotypes were reintroduced, skeletal muscle and pancreatic progenitor differentiation capabilities were restored. ANK1 expression was elevated in the ANK1 Risk and All-Risk haplotypes compared to the NKX6-3 Risk and Non-Risk haplotypes, establishing a functional experimental platform to examine risk SNP clusters in their native contexts. Overall, this work establishes a platform for the dissection of GWAS loci using synthetic haplotype genomics in hiPSCs. Significance StatementGenome-wide association studies have been used to identify disease-associated SNPs; however, most SNPs lie in non-coding regions, making functional experimentation difficult to perform. Using vSwAP-In, a yeast-based DNA variant-building method, and mSwAP-In, a mammalian genome engineering approach, we establish a platform for functional GWAS dissection in hiPSCs. This platform allows us to build DNA harboring virtually any combination of disease-risk SNPs, allowing for functional characterization of SNPs without the limitations of linkage disequilibrium. We demonstrate this approach using a Type 2 diabetes GWAS gene cluster, NKX6-3/ANK1.

14
Transcriptional Determinism and Stochasticity Contribute to the Complexity of Autism Associated SHANK Family Genes

Lu, X.; Ni, P.; Suarez-Meade, P.; Yu, M.; Forrest, E. N.; Wang, G.; Wang, Y.; Quinones-Hinojosa, A.; Gerstein, M.; Jiang, Y.-h.

2024-03-19 genetics 10.1101/2024.03.18.585480 medRxiv
Top 0.1%
28.5%
Show abstract

Precision of transcription is critical because transcriptional dysregulation is disease causing. Traditional methods of transcriptional profiling are inadequate to elucidate the full spectrum of the transcriptome, particularly for longer and less abundant mRNAs. SHANK3 is one of the most common autism causative genes. Twenty-four Shank3 mutant animal lines have been developed for autism modeling. However, their preclinical validity has been questioned due to incomplete Shank3 transcript structure. We applied an integrative approach combining cDNA-capture and long-read sequencing to profile the SHANK3 transcriptome in human and mice. We unexpectedly discovered an extremely complex SHANK3 transcriptome. Specific SHANK3 transcripts were altered in Shank3 mutant mice and postmortem brains tissues from individuals with ASD. The enhanced SHANK3 transcriptome significantly improved the detection rate for potential deleterious variants from genomics studies of neuropsychiatric disorders. Our findings suggest the stochastic transcription of genome associated with SHANK family genes.

15
Long-read transcriptome assembly reveals vast transcriptional complexity in the placenta associated with metabolic and endocrine function

Bresnahan, S. T.; Yong, H.; Wu, W. H.; Lopez, S.; Chan, J. K. Y.; White, F.; Jacques, P.-E.; Hivert, M.-F.; Chan, S.-Y.; Love, M. I.; Huang, J. Y.; Bhattacharya, A.

2025-12-29 genomics 10.1101/2025.06.26.661362 medRxiv
Top 0.1%
28.4%
Show abstract

The placenta is critical for fetal development and mediates the effects of pregnancy complications on offspring metabolic health, yet it is often poorly characterized in genomic studies. Existing transcriptomic analyses rely on adult tissue-based references, which overlook developmentally important isoform diversity. We used largest-in-class long-read RNA-seq (N=72) to create a comprehensive placental transcriptome reference, identifying 37,661 high-confidence isoforms (14,985 novel) across 12,302 genes (2,759 novel). Contrary to characterizations of the placenta as a "transcriptomic void," we found transcriptional breadth and complexity comparable to adult tissues, with extraordinary splicing diversity in genes controlling obesity, lactogen production and growth, including 108 distinct CSH1 (placental lactogen) isoforms. This improved reference offers two advantages: First, it reduced inferential uncertainty in isoform quantification by 30% and increased the yield of high-confidence transcripts. Applying this reference to short-read RNA-seq datasets (N=344) of gestational diabetes mellitus (GDM), we found that placental transcription mediated 36% of GDM effects on birth weight, with novel CSH1 isoforms identified as key mediators. We further uncovered ancestry-specific effects, with distinct CSH1 isoforms mediating larger effects in European (24.4%) than Asian (13.4%) populations. Our results establish that utilizing long-read-based, tissue-specific transcriptomic annotations is critical, enabling isoform-resolved analyses that provide greater sensitivity than conventional gene-level approaches for understanding placental function and context-specific variation across diverse biobanks.

16
Phenome-derived polygenic scores and social determinants jointly shape context-dependent disease risk

Wang, Y.; Truong, B.; Lu, W.; Fadil, C.; He, Y.; Luo, W.; Koyama, S.; Tsuo, K.; Paruchuri, K.; Yu, Z.; Hull, L. E.; Zheng, Z.; Carey, C. E.; Walters, R. K.; Neale, B. M.; Robinson, E. B.; Kraft, P.; Natarajan, P.; Martin, A. R.

2026-04-18 genetic and genomic medicine 10.64898/2026.04.16.26351039 medRxiv
Top 0.1%
26.7%
Show abstract

Polygenic scores (PGS) are typically derived from single-trait genome-wide association studies (GWAS), yet many complex diseases arise from shared genetic liability distributed across correlated clinical dimensions. Accordingly, disease risk depends not only on how genetic liability is represented but also on the social context in which that liability is expressed. Whether phenome-derived latent factors improve prediction, and how social determinants of health (SDoH) modify the realized utility of PGS, remains unclear. Here we constructed PGS for 35 orthogonal latent phenomic factors derived from 2,772 phenotypes in 361,114 UK Biobank (UKB) participants and evaluated their phenomic specificity, cross-dataset portability and predictive performance relative to conventional disease-specific PGS across the UKB holdout, Mass General Brigham Biobank and the All of Us (AoU) Research Program. Factor-based PGS showed widespread, biologically coherent phenome-wide associations that were reproducible across biobanks and ancestries. Their predictive utility, however, was strongly disease dependent. For asthma, a respiratory factor PGS outperformed an internally derived disease-specific PGS and showed superior cross-ancestry portability, retaining 41.5% of European-ancestry predictive accuracy in African-ancestry individuals, compared with 22.9% for an asthma PGS derived from the largest available multi-ancestry GWAS. By contrast, disease-specific PGS remained superior for coronary artery disease (CAD) and type 2 diabetes (T2D). These findings suggest that phenome-derived aggregation is most beneficial when disease-specific GWAS incompletely capture underlying liability, including settings of biological heterogeneity or imprecise phenotyping. We then evaluated SDoH in AoU as a complementary axis shaping prevalent disease prediction beyond genetic susceptibility. Across all three diseases, SDoH contributed substantial and largely independent predictive information beyond the disease-optimal genetic model. SDoH also modified how genetic liability translated into observed disease prevalence: for asthma and CAD, genetic stratification attenuated with increasing social burden, whereas this attenuation was substantially weaker for T2D. As a result, the same genetic percentile corresponded to different standardized predicted prevalences across social strata, reflecting disease-specific shifts in baseline prevalence, genetic gradients and calibration. Together, these findings indicate that disease risk is shaped by both genetic liability and the social context in which that liability is realized. Phenome-derived PGS improve prediction under specific architectural conditions, whereas social context independently modifies the performance, calibration and interpretation of genetic risk across populations.

17
Evaluation of polygenic scoring methods in five biobanks reveals greater variability between biobanks than between methods and highlights benefits of ensemble learning

Monti, R.; Eick, L.; Hudjashov, G.; Läll, K.; Kanoni, S.; Wolford, B. N.; Wingfield, B.; Pain, O.; Wharrie, S.; Jermy, B.; McMahon, A.; Hartonen, T.; Heyne, H. O.; Mars, N.; Genes & Health Research Team, ; Hveem, K.; Inouye, M.; van Heel, D. A.; Mägi, R.; Marttinen, P.; Ripatti, S.; Ganna, A.; Lippert, C.

2023-11-20 genetic and genomic medicine 10.1101/2023.11.20.23298215 medRxiv
Top 0.1%
26.6%
Show abstract

Methods to estimate polygenic scores (PGS) from genome-wide association studies are increasingly utilized. However, independent method evaluation is lacking, and method comparisons are often limited. Here, we evaluate polygenic scores derived using seven methods in five biobank studies (totaling about 1.2 million participants) across 16 diseases and quantitative traits, building on a reference-standardized framework. We conducted meta-analyses to quantify the effects of method choice, hyperparameter tuning, method ensembling and target biobank on PGS performance. We found that no single method consistently outperformed all others. PGS effect sizes were more variable between biobanks than between methods within biobanks when methods were well-tuned. Differences between methods were largest for the two investigated autoimmune diseases, seropositive rheumatoid arthritis and type 1 diabetes. For most methods, cross-validation was more reliable for tuning hyperparameters than automatic tuning (without the use of target data). For a given target phenotype, elastic net models combining PGS across methods (ensemble PGS) tuned in the UK Biobank provided consistent, high, and cross-biobank transferable performance, increasing PGS effect sizes ({beta}-coefficients) by a median of 5.0% relative to LDpred2 and MegaPRS (the two best performing single methods when tuned with cross-validation). Our interactively browsable online-results (https://methodscomparison.intervenegeneticscores.org/) and open-source workflow prspipe (https://github.com/intervene-EU-H2020/prspipe) provide a rich resource and reference for the analysis of polygenic scoring methods across biobanks.

18
Identification of moderate effect size genes in autism spectrum disorder through a novel gene pairing approach

Caballero, M.; Satterstrom, F. K.; Buxbaum, J.; Mahjani, B.

2024-04-04 genetic and genomic medicine 10.1101/2024.04.03.24305278 medRxiv
Top 0.1%
25.9%
Show abstract

Autism Spectrum Disorder (ASD) arises from complex genetic and environmental factors, with inherited genetic variation playing a substantial role. This study introduces a novel approach to uncover moderate effect size (MES) genes in ASD, which individually do not meet the ASD liability threshold but collectively contribute when paired with specific other MES genes. Analyzing 10,795 families from the SPARK dataset, we identified 97 MES genes forming 50 significant gene pairs, demonstrating a substantial association with ASD when considered in tandem, but not individually. Our method leverages familial inheritance patterns and statistical analyses, refined by comparisons against control cohorts, to elucidate these gene pairs contribution to ASD liability. Furthermore, expression profile analyses of these genes in brain tissues underscore their relevance to ASD pathology. This study underscores the complexity of ASDs genetic landscape, suggesting that gene combinations, beyond high impact single-gene mutations, significantly contribute to the disorders etiology and heterogeneity. Our findings pave the way for new avenues in understanding ASDs genetic underpinnings and developing targeted therapeutic strategies.

19
Using a modular massively parallel reporter assay to discover context-specific regulatory grammars in type 2 diabetes

Tovar, A.; Kyono, Y.; Nishino, K.; Bose, M.; Varshney, A.; Parker, S. C. J.; Kitzman, J. O.

2023-10-10 genomics 10.1101/2023.10.08.561391 medRxiv
Top 0.1%
25.7%
Show abstract

Most genome-wide association signals for complex disease reside in the noncoding genome, where defining function is nontrivial. MPRAs (massively parallel reporter assays) offer a scalable means to identify functional regulatory elements, but are typically conducted without regard to cell type, pairing cloned fragments with a generic housekeeping promoter. To explore the context-sensitivity of MPRAs, we screened enhancer activity across a panel of nearly 12,000 198-bp fragments spanning over 300 type 2 diabetes- and metabolic trait-associated regions in the 832/13 rat insulinoma beta cell line, a relevant model of pancreatic beta cells. We explored these fragments context sensitivity by comparing their activities when placed up- or downstream of a reporter gene, and in combination with either a synthetic housekeeping promoter (SCP1) or a more biologically relevant promoter corresponding to the human insulin (INS) gene. We identified clear effects of MPRA construct design on enhancer activity. Specifically, a subset of fragments (n = 702/11,656) displayed positional bias, evenly distributed across up- and downstream preference. Promoter choice also influenced MPRA activity (n = 698/11,656), mostly biased towards the cell-specific INS promoter (73.4%). To identify sequence features associated with promoter preference, we used Lasso regression with 562 genomic annotations and discovered that fragments with INS promoter-biased activity are enriched for HNF1 motifs. HNF1 family transcription factors are key regulators of glucose metabolism disrupted in maturity onset diabetes of the young (MODY), suggesting genetic convergence between rare coding variants that cause MODY and common T2D-associated regulatory regions. We designed a follow-up MPRA containing HNF1 motif-enriched fragments and observed several instances where deletion or mutation of HNF1 motifs disrupted the INS promoter-biased enhancer activity, specifically in the beta cell model but not in a skeletal muscle cell line, another diabetes-relevant cell type. Together, our study suggests that cell-specific regulatory activity is partially influenced by enhancer-promoter compatibility and indicates that careful attention should be paid when designing MPRA libraries to capture context-specific regulatory processes at disease-associated genetic signals.

20
LAT encodes T cell activation pathway balance

Rubin, A. J.; Dao, T. T.; Schueppert, A. V.; Regev, A.; Shalek, A. K.

2024-08-26 genomics 10.1101/2024.08.26.609683 medRxiv
Top 0.1%
25.4%
Show abstract

Immune cells transduce environmental stimuli into responses essential for host health via complex signaling cascades. T cells, in particular, leverage their unique T cell receptors (TCRs) to detect specific Human Leukocyte Antigen (HLA)-presented peptides. TCR activation is then relayed via linker for activation of T cells (LAT), a TCR-proximal disordered adapter protein, which organizes protein partners and mediates the propagation of signals down diverse pathways including NFAT and AP-1. Here, we studied how balanced downstream pathway activation is encoded in the amino acid sequence of LAT. To comprehensively profile the sequence-function relationship of LAT, we developed a pooled, single-cell, high-content screening approach in which a large series of mutants in the LAT protein were analyzed to characterize their effects on T cell activation. Measuring epigenetic, transcriptomic, and cell surface protein dynamics of single cells harboring distinct LAT mutants, we found functional regions spanning over 40% of the LAT amino acid sequence. Conserved sequence motifs for protein interactions along with charge distribution are critical sequence features, and contribute to interpretation of human genetic variation in LAT. While mutant defect severity spans from moderate to complete loss of function, nearly all defective mutants, irrespective of their position in LAT, confer balanced defects across all downstream pathways. To understand the molecular basis for this observation, we performed proximal protein labeling which demonstrated that disruption of LAT interaction with a single partner protein indirectly disrupts other partner interactions, likely through the dual roles of these proteins as effectors of downstream pathways and bridging factors between LAT molecules. Overall, we report widely distributed functional regions throughout a disordered adapter and a precise physical organization of LAT and interacting molecules which constrains signaling outputs. More broadly, we describe an approach for interrogating sequence-function relationships for proteins with complex activities across regulatory layers of the cell.